AITopics | constrained policy optimization

Collaborating Authors

constrained policy optimization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Value-at-Risk Constrained Policy Optimization

Tangri, Rohan, Calliess, Jan-Peter

arXiv.org Machine LearningFeb-2-2026

We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constraints directly. Empirically, we demonstrate that VaR-CPO is capable of safe exploration, achieving zero constraint violations during training in feasible environments, a critical property that baseline methods fail to uphold. To overcome the inherent non-differentiability of the VaR constraint, we employ the one-sided Chebyshev inequality to obtain a tractable surrogate based on the first two moments of the cost return. Additionally, by extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we provide rigorous worst-case bounds for both policy improvement and constraint violation during the training process.

constraint, machine learning, reinforcement learning, (18 more...)

arXiv.org Machine Learning

2601.22993

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

Constrained Policy Optimization with Explicit Behavior Density For Offline Reinforcement Learning

Neural Information Processing SystemsDec-23-2025, 22:59:22 GMT

Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Existing methods for addressing this issue either control policy to exclude the OOD action or make the $Q$ function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable exploration within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the optimal $Q$-function value. Empirically, CPED outperforms existing alternatives on various standard offline reinforcement learning tasks, yielding higher expected returns.

constrained policy optimization, explicit behavior density, offline reinforcement learning, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.55)

Add feedback

Conformal Constrained Policy Optimization for Cost-Effective LLM Agents

Si, Wenwen, Jang, Sooyong, Lee, Insup, Bastani, Osbert

arXiv.org Artificial IntelligenceNov-18-2025

While large language models (LLMs) have recently made tremendous progress towards solving challenging AI problems, they have done so at increasingly steep computational and API costs. We propose a novel strategy where we combine multiple LLM models with varying cost/accuracy tradeoffs in an agentic manner, where models and tools are run in sequence as determined by an orchestration model to minimize cost subject to a user-specified level of reliability; this constraint is formalized using conformal prediction to provide guarantees. To solve this problem, we propose Conformal Constrained Policy Optimization (CCPO), a training paradigm that integrates constrained policy optimization with off-policy reinforcement learning and recent advances in online conformal prediction. CCPO jointly optimizes a cost-aware policy (score function) and an adaptive threshold. Across two multi-hop question answering benchmarks, CCPO achieves up to a 30% cost reduction compared to other cost-aware baselines and LLM-guided methods without compromising reliability. Our approach provides a principled and practical framework for deploying LLM agents that are significantly more cost-effective while maintaining reliability.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.11828

Country: Asia (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Constrained Policy Optimization with Explicit Behavior Density For Offline Reinforcement Learning

Neural Information Processing SystemsOct-9-2024, 20:12:15 GMT

Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Existing methods for addressing this issue either control policy to exclude the OOD action or make the Q function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable exploration within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the optimal Q -function value.

constrained policy optimization, explicit behavior density, offline reinforcement learning, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)

Add feedback

Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization

Xu, Haotian, Wang, Shengjie, Wang, Zhaolei, Zhang, Yunzhe, Zhuo, Qing, Gao, Yang, Zhang, Tao

arXiv.org Artificial IntelligenceJul-27-2023

Reinforcement learning (RL) has achieved promising results on most robotic control tasks. Safety of learning-based controllers is an essential notion of ensuring the effectiveness of the controllers. Current methods adopt whole consistency constraints during the training, thus resulting in inefficient exploration in the early stage. In this paper, we propose an algorithm named Constrained Policy Optimization with Extra Safety Budget (ESB-CPO) to strike a balance between the exploration efficiency and the constraints satisfaction. In the early stage, our method loosens the practical constraints of unsafe transitions (adding extra safety budget) with the aid of a new metric we propose. With the training process, the constraints in our optimization problem become tighter. Meanwhile, theoretical analysis and practical experiments demonstrate that our method gradually meets the cost limit's demand in the final training stage. When evaluated on Safety-Gym and Bullet-Safety-Gym benchmarks, our method has shown its advantages over baseline algorithms in terms of safety and optimality. Remarkably, our method gains remarkable performance improvement under the same cost limit compared with baselines.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

arXiv.org Artificial Intelligence

2302.14339

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.50)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.34)

Add feedback

Locally Constrained Policy Optimization for Online Reinforcement Learning in Non-Stationary Input-Driven Environments

Hamadanian, Pouya, Nasr-Esfahany, Arash, Sen, Siddartha, Schwarzkopf, Malte, Alizadeh, Mohammad

arXiv.org Artificial IntelligenceFeb-4-2023

We study online Reinforcement Learning (RL) in non-stationary input-driven environments, where a time-varying exogenous input process affects the environment dynamics. Online RL is challenging in such environments due to catastrophic forgetting (CF). The agent tends to forget prior knowledge as it trains on new experiences. Prior approaches to mitigate this issue assume task labels (which are often not available in practice) or use off-policy methods that can suffer from instability and poor performance. We present Locally Constrained Policy Optimization (LCPO), an on-policy RL approach that combats CF by anchoring policy outputs on old experiences while optimizing the return on current experiences. To perform this anchoring, LCPO locally constrains policy optimization using samples from experiences that lie outside of the current input distribution. We evaluate LCPO in two gym and computer systems environments with a variety of synthetic and real input traces, and find that it outperforms state-of-the-art on-policy and off-policy RL methods in the online setting, while achieving results on-par with an offline agent pre-trained on the whole input trace.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2302.02182

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.28)
North America > United States > New York > New York County > New York City (0.14)
Asia > Middle East > Jordan (0.04)
(5 more...)

Genre:

Research Report (0.83)
Instructional Material > Online (0.61)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Distributional constrained reinforcement learning for supply chain optimization

Bermúdez, Jaime Sabal, Chanona, Antonio del Rio, Tsay, Calvin

arXiv.org Artificial IntelligenceFeb-3-2023

This work studies reinforcement learning (RL) in the context of multi-period supply chains subject to constraints, e.g., on production and inventory. We introduce Distributional Constrained Policy Optimization (DCPO), a novel approach for reliable constraint satisfaction in RL. Our approach is based on Constrained Policy Optimization (CPO), which is subject to approximation errors that in practice lead it to converge to infeasible policies. We address this issue by incorporating aspects of distributional RL into DCPO. Specifically, we represent the return and cost value functions using neural networks that output discrete distributions, and we reshape costs based on the associated confidence. Using a supply chain case study, we show that DCPO improves the rate at which the RL policy converges and ensures reliable constraint satisfaction by the end of training. The proposed method also improves predictability, greatly reducing the variance of returns between runs, respectively; this result is significant in the context of policy gradient methods, which intrinsically introduce significant variance during training.

constraint, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2302.01727

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Netherlands > South Holland > Delft (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

Wen, Lu, Duan, Jingliang, Li, Shengbo Eben, Xu, Shaobing, Peng, Huei

arXiv.org Artificial IntelligenceMar-2-2020

Reinforcement learning (RL) is attracting increasing interests in autonomous driving due to its potential to solve complex classification and control problems. However, existing RL algorithms are rarely applied to real vehicles for two predominant problems: behaviours are unexplainable, and they cannot guarantee safety under new scenarios. This paper presents a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for two autonomous driving tasks. PCPO extends today's common actor-critic architecture to a three-component learning framework, in which three neural networks are used to approximate the policy function, value function and a newly added risk function, respectively. Meanwhile, a trust region constraint is added to allow large update steps without breaking the monotonic improvement condition. To ensure the feasibility of safety constrained problems, synchronized parallel learners are employed to explore different state spaces, which accelerates learning and policy-update. The simulations of two scenarios for autonomous vehicles confirm we can ensure safety while achieving fast learning.

algorithm, constraint, vehicle, (13 more...)

arXiv.org Artificial Intelligence

2003.01303

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
Asia > China > Beijing > Beijing (0.05)
North America > United States > Minnesota > Ramsey County > Saint Paul (0.04)
(5 more...)

Genre: Research Report (0.64)

Industry:

Automobiles & Trucks (1.00)
Transportation > Ground > Road (0.55)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback